simdutf_connector: in_tail: skip UTF-16/UTF-8 BOM#10328
simdutf_connector: in_tail: skip UTF-16/UTF-8 BOM#10328edsiper merged 5 commits intofluent:masterfrom
Conversation
|
Could you please take a look at this @cosmo0920? There are a few coding style issues but I'm more interested validating in the actual UTF stuff. |
cosmo0920
left a comment
There was a problem hiding this comment.
I tested the patch in my dev box and I got succeeded to convert from UTF-16-LE and UTF-16-BE to UTF-8. And I did use std::unique_ptr<char[]> to proceed allocate/deallocate automatically in the simdutf module. Changed from it to flb_malloc style was not considered TBH. This could be fine and there's nothing memory leaks.
|
Ah, jemalloc headers are not found in CI tasks. Need to investigate. |
It's fine to me for this change. Could you proceed to point out minor issues on your side? |
There was a problem hiding this comment.
One thing, could you add the following lines into end of the file here?
if(FLB_JEMALLOC)
target_link_libraries(flb-simdutf-connector-static ${JEMALLOC_LIBRARIES})
endif()
It seems jemalloc related error could be caused by missing dependency of jemalloc. So, we have to mark jemalloc as one of the dependencies of simdutf-connector.
Done. Thanks for swift feedback and help with deciphering the build errors. |
|
I identified the weird compilation errors on Windows here: |
leonardo-albertovich
left a comment
There was a problem hiding this comment.
I've left some change requests, please check all of the code for those issues I've pointed out, I tried not to add one note per incidence but noticed multiple occurrences of some of them (such as the missing exception handling).
plugins/in_tail/tail_file.c
Outdated
| else if (ret == FLB_UNICODE_CONVERT_NOP) { | ||
| flb_plg_debug(ctx->ins, "nothing to convert encoding '%.*s'", end - data, data); | ||
| /* Skip the UTF-8 BOM */ | ||
| if ((end - data) >= 3 && (data[0] & 0xFF) == 0xEF && (data[1] & 0xFF) == 0xBB |
There was a problem hiding this comment.
I think in this branch of the conditional the buffer has not changed and thus file->buf_len is still valid which means the conditional should be written as :
if (file->buf_len >= 3 &&
(data[0] & 0xFF) == 0xEF &&
(data[1] & 0xFF) == 0xBB &&
(data[2] & 0xFF) == 0xBF) {
Additionally, is there any reason for us not to define data as unsigned char * so we can simplify this?
There was a problem hiding this comment.
That seems accurate, I just took the expression from the line above. Changing it where, data originally comes from the flb_tail_file struct. In this case I think an easer solution would be to use char constants instead, e.g. '\xFF' which makes the integer promotion behave the same way for both the lhs and rhs of the comparison.
| result = simdutf::validate_utf8_with_errors(output.get(), clen); | ||
| if (result.error == simdutf::error_code::SUCCESS && converted > 0) { | ||
| std::string result_string(output.get(), clen); | ||
| *utf8_output = (char*)flb_malloc(clen + 1); |
There was a problem hiding this comment.
Please check the coding style guide for this and other issues.
In this case the missing spaces in the data type type and after the closing parenthesis of the cast.
| aligned_input = (const char16_t *)input; | ||
| } | ||
| else { | ||
| str16.resize(len / 2); |
There was a problem hiding this comment.
According to the C++ reference we are missing some exception handling here.
There was a problem hiding this comment.
Check, it might be be a similar amount of work (and more consistent with the rest of the C codebase) to just use flb_malloc here as well, since the simdutf functions used are noexcept-labelled.
|
Hi, could you rebase off the current master? |
- Do not copy input if data is already aligned. - Only allocate output once. Signed-off-by: Erik Cederberg <erik.cederberg@sectra.com>
When converting UTF-16 to UTF-8, ingore the BOM so that no UTF-8 BOM is written to the output. Signed-off-by: Erik Cederberg <erik.cederberg@sectra.com>
If unicode input data is not converted, check if there is a UTF-8 BOM present and skip it. Signed-off-by: Erik Cederberg <erik.cederberg@sectra.com>
Signed-off-by: Erik Cederberg <erik.cederberg@sectra.com>
cosmo0920
left a comment
There was a problem hiding this comment.
Could you add static attributes for OSSFuzz related codes?
diff --git a/include/fluent-bit/flb_mem.h b/include/fluent-bit/flb_mem.h
index 3c83580db..ea5eb62cc 100644
--- a/include/fluent-bit/flb_mem.h
+++ b/include/fluent-bit/flb_mem.h
@@ -50,8 +50,8 @@
/*
* Return 1 or 0 based on a probability.
*/
-int flb_malloc_p;
-int flb_malloc_mod;
+static int flb_malloc_p;
+static int flb_malloc_mod;
static inline int flb_fuzz_get_probability(int val) {
flb_malloc_p += 1;This could fix compilation errors for OSSFuzz.
Move definitions to its own file to avoid breaking the C++ one definition rule, improving the integration with the C++ code in flb-simdutf-connector. Signed-off-by: Erik Cederberg <erik.cederberg@sectra.com>
I did not see this until I after pushed my latest change, but I made them |
Ah, yes. Using |
This MR updates
simdutf_connectorto reduce the number of copies when converting UTF-16 to UTF-8 and to remove the UTF-16 BOM prior to conversion so that no UTF-8 BOM is present in the converted output.tail_fileis also updated to skip any encountered UTF-8 BOM if the unicode conversion returnsFLB_UNICODE_CONVERT_NOP.Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.